Goto

Collaborating Authors

 stochastic gradient mcmc


Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing Systems

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w.r.t. the number of workers.


A Complete Recipe for Stochastic Gradient MCMC

Neural Information Processing Systems

Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices.


Reviews: Stochastic Gradient MCMC with Stale Gradients

Neural Information Processing Systems

Technical quality: I think that the theory is very complete (bounds are given for pretty much everything relevant to the problem), and the experiments show that this method performs better on large/complicated models (the small/simple models have too little variance for extra servers to help, and the staleness prevents much benefits). I think the biggest limitation of the paper is the lack of comparison against the method in [14] (the paper mostly compares against the non-distributed -- 1 worker -- case, instead of a more standard distributed case). Novelty/originality: My impression is theoretical results are mostly a combination of proof techniques used in other SG-MCMC and asynchronous SGD papers (however, I'm not too sure that this claim is correct). Assuming this is true, I think the results are well-executed, but not too unique. Potential impact or usefulness: I think the theoretical analysis will be useful for people interested in how asynchrony affects SG-MCMC. However, I'm not too clear how much this will help for running SG-MCMC in practice.


Learning to Explore for Stochastic Gradient MCMC

Kim, SeungHyun, Jung, Seohyeon, Kim, Seonghyeon, Lee, Juho

arXiv.org Artificial Intelligence

Bayesian Neural Networks(BNNs) with high-dimensional parameters pose a challenge for posterior inference due to the multi-modality of the posterior distributions. Stochastic Gradient MCMC(SGMCMC) with cyclical learning rate scheduling is a promising solution, but it requires a large number of sampling steps to explore high-dimensional multi-modal posteriors, making it computationally expensive. In this paper, we propose a meta-learning strategy to build \gls{sgmcmc} which can efficiently explore the multi-modal target distributions. Our algorithm allows the learned SGMCMC to quickly explore the high-density region of the posterior landscape. Also, we show that this exploration property is transferrable to various tasks, even for the ones unseen during a meta-training stage. Using popular image classification benchmarks and a variety of downstream tasks, we demonstrate that our method significantly improves the sampling efficiency, achieving better performance than vanilla \gls{sgmcmc} without incurring significant computational overhead.

  learning, stochastic gradient mcmc
2408.0914

Scaling up Dynamic Edge Partition Models via Stochastic Gradient MCMC

Yang, Sikun, Koeppl, Heinz

arXiv.org Artificial Intelligence

The edge partition model (EPM) is a generative model for extracting an overlapping community structure from static graph-structured data. In the EPM, the gamma process (GaP) prior is adopted to infer the appropriate number of latent communities, and each vertex is endowed with a gamma distributed positive memberships vector. Despite having many attractive properties, inference in the EPM is typically performed using Markov chain Monte Carlo (MCMC) methods that prevent it from being applied to massive network data. In this paper, we generalize the EPM to account for dynamic enviroment by representing each vertex with a positive memberships vector constructed using Dirichlet prior specification, and capturing the time-evolving behaviour of vertices via a Dirichlet Markov chain construction. A simple-to-implement Gibbs sampler is proposed to perform posterior computation using Negative- Binomial augmentation technique. For large network data, we propose a stochastic gradient Markov chain Monte Carlo (SG-MCMC) algorithm for scalable inference in the proposed model. The experimental results show that the novel methods achieve competitive performance in terms of link prediction, while being much faster.


AdamMCMC: Combining Metropolis Adjusted Langevin with Momentum-based Optimization

Bieringer, Sebastian, Kasieczka, Gregor, Steffen, Maximilian F., Trabs, Mathias

arXiv.org Machine Learning

Uncertainty estimation is a key issue when considering the application of deep neural network methods in science and engineering. In this work, we introduce a novel algorithm that quantifies epistemic uncertainty via Monte Carlo sampling from a tempered posterior distribution. It combines the well established Metropolis Adjusted Langevin Algorithm (MALA) with momentum-based optimization using Adam and leverages a prolate proposal distribution, to efficiently draw from the posterior. We prove that the constructed chain admits the Gibbs posterior as an invariant distribution and converges to this Gibbs posterior in total variation distance. Numerical evaluations are postponed to a first revision.


Non-convex Bayesian Learning via Stochastic Gradient Markov Chain Monte Carlo

Deng, Wei

arXiv.org Artificial Intelligence

The rise of artificial intelligence (AI) hinges on the efficient training of modern deep neural networks (DNNs) for non-convex optimization and uncertainty quantification, which boils down to a non-convex Bayesian learning problem. A standard tool to handle the problem is Langevin Monte Carlo, which proposes to approximate the posterior distribution with theoretical guarantees. In this thesis, we start with the replica exchange Langevin Monte Carlo (also known as parallel tempering), which proposes appropriate swaps between exploration and exploitation to achieve accelerations. However, the na\"ive extension of swaps to big data problems leads to a large bias, and bias-corrected swaps are required. Such a mechanism leads to few effective swaps and insignificant accelerations. To alleviate this issue, we first propose a control variates method to reduce the variance of noisy energy estimators and show a potential to accelerate the exponential convergence. We also present the population-chain replica exchange based on non-reversibility and obtain an optimal round-trip rate for deep learning. In the second part of the thesis, we study scalable dynamic importance sampling algorithms based on stochastic approximation. Traditional dynamic importance sampling algorithms have achieved success, however, the lack of scalability has greatly limited their extensions to big data. To handle this scalability issue, we resolve the vanishing gradient problem and propose two dynamic importance sampling algorithms. Theoretically, we establish the stability condition for the underlying ordinary differential equation (ODE) system and guarantee the asymptotic convergence of the latent variable to the desired fixed point. Interestingly, such a result still holds given non-convex energy landscapes.


Contributions to Large Scale Bayesian Inference and Adversarial Machine Learning

Gallego, Víctor

arXiv.org Machine Learning

The rampant adoption of ML methodologies has revealed that models are usually adopted to make decisions without taking into account the uncertainties in their predictions. More critically, they can be vulnerable to adversarial examples. Thus, we believe that developing ML systems that take into account predictive uncertainties and are robust against adversarial examples is a must for critical, real-world tasks. We start with a case study in retailing. We propose a robust implementation of the Nerlove-Arrow model using a Bayesian structural time series model. Its Bayesian nature facilitates incorporating prior information reflecting the manager's views, which can be updated with relevant data. However, this case adopted classical Bayesian techniques, such as the Gibbs sampler. Nowadays, the ML landscape is pervaded with neural networks and this chapter also surveys current developments in this sub-field. Then, we tackle the problem of scaling Bayesian inference to complex models and large data regimes. In the first part, we propose a unifying view of two different Bayesian inference algorithms, Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) and Stein Variational Gradient Descent (SVGD), leading to improved and efficient novel sampling schemes. In the second part, we develop a framework to boost the efficiency of Bayesian inference in probabilistic models by embedding a Markov chain sampler within a variational posterior approximation. After that, we present an alternative perspective on adversarial classification based on adversarial risk analysis, and leveraging the scalable Bayesian approaches from chapter 2. In chapter 4 we turn to reinforcement learning, introducing Threatened Markov Decision Processes, showing the benefits of accounting for adversaries in RL while the agent learns.


Stochastic Gradient MCMC with Multi-Armed Bandit Tuning

Coullon, Jeremie, South, Leah, Nemeth, Christopher

arXiv.org Machine Learning

Most MCMC algorithms contain user-controlled hyperparameters which need to be carefully selected to ensure that the MCMC algorithm explores the posterior distribution efficiently. Optimal tuning rates for many popular MCMC algorithms such the random-walk (Gelman et al., 1997) or Metropolis-adjusted Langevin algorithms (Roberts and Rosenthal, 1998) rely on setting the tuning parameters according to the Metropolis-Hastings acceptance rate. Using metrics such as the acceptance rate, hyperparameters can be optimized on-the-fly within the MCMC algorithm using adaptive MCMC (Andrieu and Thoms, 2008; Vihola, 2012). However, in the context of stochastic gradient MCMC (SGMCMC), there is no acceptance rate to tune against and the trade-off between bias and variance for a fixed computational budget means that tuning approaches designed for target invariant MCMC algorithms are not applicable. Related work Previous adaptive SGMCMC algorithms have focused on embedding ideas from the optimization literature within the SGMCMC framework, e.g.


Targeted stochastic gradient Markov chain Monte Carlo for hidden Markov models with rare latent states

Ou, Rihui, Sen, Deborshee, Young, Alexander L, Dunson, David B

arXiv.org Machine Learning

Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler. This makes them computationally slow as the length of the time series increases, motivating the recent development of sub-sampling-based approaches. These approximate the full posterior by using small random subsequences of the data at each MCMC iteration within stochastic gradient MCMC. In the presence of imbalanced data resulting from rare latent states, subsequences often exclude rare latent state data, leading to inaccurate inference and prediction/detection of rare events. We propose a targeted sub-sampling (TASS) approach that over-samples observations corresponding to rare latent states when calculating the stochastic gradient of parameters associated with them. TASS uses an initial clustering of the data to construct subsequence weights that reduce the variance in gradient estimation. This leads to improved sampling efficiency, in particular in settings where the rare latent states correspond to extreme observations. We demonstrate substantial gains in predictive and inferential accuracy on real and synthetic examples.